Start up

Download airbnb_nyc_2019.csv and Airbnb_analysis.R from the course website. Make sure that you save them in the folder that you have been using for this class!

Open up Airbnb_analysis.R and run the lines of code that you see there. Do you understand what the code is doing? Some of the comments have been left blank: fill them up with what you believe the code to be doing.

The rest of this document will go through how we can convert the R script Airbnb_analysis.R into an R markdown document.

Creating an R markdown file

Let’s create an R markdown file where we will write up our data analysis. Click on the icon in the top-left corner of the window and select “R Markdown…” In the window that pops up, select “Document” in the sidebar on the left. Type in “Analysis of Airbnb Data in NYC 2019” for Title, and your name for Author. For “Default Output Format”, select HTML. Click “OK”.


Upon clicking “OK”, a new sub-window appears in the top-left of our RStudio window with some default text. Notice how the filename is “Untitled1”? Save the document in our class folder with the name “Airbnb analysis”. The filename in the window will become “Airbnb analysis.Rmd”.


The top section (boxed in red) is called the YAML header (“yet another markup language”). It is separated from the rest of the document by ---s. R markdown uses it to control many details of the whole document. We won’t talk much about this header in this class. Just notice that the “title” and “author” fields were automatically populated by what we filled in in an earlier window, and that the date is the date when the document was created. You can change these fields by manually editing them here.

To create the HTML document from this R Markdown file, click on the button (or use the shortcut Cmd/Ctrl + Shift + K). A couple of things happen when you do this:

(It’s possible that your preview shows up in the “Viewer” window in the bottom-right corner as well. To expand it to a new window, click the “Show in new window” button (on the right of the broom icon).)

If you open up the “.html” file in your web browser, you will see that it is the same as the preview.

Compare the contents of the .Rmd file with the preview that you see. Can you see how the markdown syntax (such as the ## before “R Markdown” and the asterisks surrounding “Knit”) get styled in the final document?

Next, notice how code chunks are represented in the .Rmd file. They start with ```{r} and end with ```. The next word after r (e.g. cars, pressure) is the name of the code chunk. If you scroll through the “R Markdown” tab in the bottom-left window, you’ll see these names pop-up. Code chunks don’t need to have a name. After the name of the code chunk, you may see things like echo=FALSE or include=FALSE. We’ll talk about these as we go along.

Finally, notice that our environment is empty (see “Environment” tab in the top-left window). When we knit a document, R essentially starts a new session/environment and runs all the code there.

Airbnb analysis

To illustrate how to use R markdown for presenting data analyses, we will work through a case study on Airbnb listings in New York City (NYC) in 2019. The data analysis is mean to be illustrative, not comprehensive.

As a start, delete everything in the .Rmd file except the YAML header and the first code chunk (the one that has {r setup, include=FALSE} at the top).

Set-up

The code in the chunk labeled “setup” sets global options for all code chunks to follow. By setting echo=TRUE, all code chunks that follow will be printed, along with their result. (If we set it to echo=FALSE, we will not see the code chunks in the published document. However, the code is still run and the results of the code will be shown.)

Introduction

It’s always a good idea to have an introduction section to your data analysis. Type the following below the setup code chunk:

## Introduction

This is an analysis of Airbnb listings in New York City (NYC) in 2019. The data was taken from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/.

Data import and checking

Next, let’s create a code chunk to import the libraries that we will use. It’s hard to know exactly which packages we are going to use in advance, but we can always go back to this chunk and amend it later. There are a number of ways to create a code chunk:

  • Manually type ```{r}, followed by your code, then closing the chunk with ```,
  • Clicking the button, followed by “R”, or
  • Using the Cmd/Ctrl + Alt + I shortcut.

After creating the code chunk, type the following line in the chunk:

library(tidyverse)

Next, create another code chunk to read in data:

df <- read_csv("airbnb_nyc_2019.csv", 
               col_types = cols(host_id = col_character(), 
                                id = col_character(), 
                                last_review = col_date(format = "%Y-%m-%d")))

Knit the document to see what it looks like at this point. See how there is a whole bunch of messages after the library(tidyverse) line? While informative when doing our data analysis, it’s probably something we don’t want to present. To remove this message (and all other future messages), go to the setup code chunk and amend knitr::opts_chunk$set(echo = TRUE) to knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE). If we knit the document now, we’ll see that the tidyverse messages are no longer there.

Next, create another code chunk and put in the lines of code which give us a feel for the data:

dim(df)
head(df)
names(df)

The knitr package provides us with a function, kable(), that helps print datasets more nicely in R markdown files. Add library(knitr) to the library imports chunk, and change head(df) to kable(head(df)). Knit the document again to see the difference.

To orient our reader, we may want to add some text before that code chunk along the lines of “the dataset contains the following columns:”.

R in-line

So far, all the R code has been in chunks. It is possible to have R code within the text itself too! For example, instead of dedicating an R chunk for nrow(df) and ncol(df), we could have the following line outside an R code chunk:

The dataset has `r nrow(df)` rows and `r ncol(df)` columns.

When you knit the document again, notice how the command nrow(df) is run and the output is printed (instead of the code itself).

Data analysis and conclusion

We can repeat the process above for the rest of the code in Airbnb_analysis.R:

  1. Create a new code chunk.
  2. Copy and paste a small, digestible piece of code from the R script into the chunk. Usually we end the chunk when we want to print some output to screen, or when we’ve completed a logical step in the data analysis.
  3. Added suitable text and headers before/after the code chunk to explain what is going on.

After the data analysis, you should end off with a conclusion section. This can just be a summary of the results presented, or it could also include takeaway lessons, limitations of the analysis and/or future directions.

You can find a complete version of this Airbnb analysis (both .Rmd and .html file) on the course website.

Optional material

R chunk options

We can specify “options” for each R chunk to change how the output looks like. For example, the chunk below makes a histogram of log10(price).

ggplot(df) +
    geom_histogram(aes(x = log10(price)))

We may want to change the size of the figure (e.g. for different aspect ratio, or to save space). The way to do that is to replace the ```{r} at the top of the chunk to ```{r fig.width=6, fig.height=3}:

ggplot(df) +
    geom_histogram(aes(x = log10(price)))

(The default values are fig.width=7 and fig.height=5.)

Remember the setup chunk right at the top of the R markdown document? If a particular code chunk does not have any options specified, it will follow whatever is in the setup chunk.

Here are some commonly used options:

  • include = FALSE: prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
    • Useful for decluttering your Rmd output, showing only essential code.
  • echo = FALSE: prevents code, but not the results from appearing in the finished file.
    • Useful if you just want to show figures but not code that generated it.
  • eval = FALSE: Code appears in the output but is not run.
    • Useful for presenting code for demonstration purposes.
  • message = FALSE: prevents messages that are generated by code from appearing in the finished file.
    • Useful for suppressing messages when loading packages.
  • warning = FALSE: prevents warnings that are generated by code from appearing in the finished.
    • Useful for suppressing warnings when loading packages, plotting data or fitting models.

Workflow

In this lab, we started with a working R script, then converted that R script into an R markdown document. While tedious, this is a great way to create R markdown documents as it ensures that the code itself is working.

When you are more familiar with R, you can also starting writing R markdown documents from scratch, typing in the code as you go. The only trouble there is that to check that your code is working, you have to knit the document after writing each chunk to check if you got the result you wanted.

One way to speed up the process of writing an .Rmd file is to run the code in the Console instead. There are 3 ways to do this:

  • Copying and pasting the code from the .Rmd file into the console and pressing Enter,
  • Highlighting the code and clicking the button or using the Cmd/Ctrl + Enter shortcut, or
  • Pressing the “play” button in the top-right corner of the chunk, which runs all the code in that chunk.

By mimicing the knitting process in the console, this allows us to ensure that the code chunks evaluate to the result we want without knitting over and over again.

Session info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
## [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.6.1  
##  [5] tools_3.6.1      zeallot_0.1.0    digest_0.6.20    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.14    nlme_3.1-140     gtable_0.3.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.4.0      cli_1.1.0       
## [17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.1      xfun_0.9        
## [21] withr_2.1.2      xml2_1.2.2       httr_1.4.1       knitr_1.24      
## [25] vctrs_0.2.0      generics_0.0.2   hms_0.5.1        grid_3.6.1      
## [29] tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         readxl_1.3.1    
## [33] rmarkdown_1.15   modelr_0.1.5     magrittr_1.5     backports_1.1.4 
## [37] scales_1.0.0     htmltools_0.3.6  rvest_0.3.4      assertthat_0.2.1
## [41] colorspace_1.4-1 labeling_0.3     stringi_1.4.3    lazyeval_0.2.2  
## [45] munsell_0.5.0    broom_0.5.2      crayon_1.3.4